Length-Incremental Phrase Training for SMT
نویسندگان
چکیده
We present an iterative technique to generate phrase tables for SMT, which is based on force-aligning the training data with a modified translation decoder. Different from previous work, we completely avoid the use of a word alignment or phrase extraction heuristics, moving towards a more principled phrase generation and probability estimation. During training, we allow the decoder to generate new phrases on-the-fly and increment the maximum phrase length in each iteration. Experiments are carried out on the IWSLT 2011 Arabic-English task, where we are able to reach moderate improvements on a state-of-the-art baseline with our training method. The resulting phrase table shows only a small overlap with the heuristically extracted one, which demonstrates the restrictiveness of limiting phrase selection by a word alignment or heuristics. By interpolating the heuristic and the trained phrase table, we can improve over the baseline by 0.5% BLEU and 0.5% TER.
منابع مشابه
Incrementally Updating the SMT Reordering Model
This work is concerned with incrementally training statistical machine translation (SMT) models when new data becomes available. That, in contrast to re-training new models based on the entire accumulated data. Incremental training provides a way to perform faster, more frequent model updates, enabling keeping the SMT system up-to-date with the most recent data. Specifically, we address increme...
متن کاملLexical Syntax for Statistical Machine Translation
Statistical Machine Translation (SMT) is by far the most dominant paradigm of Machine Translation. This can be justified by many reasons, such as accuracy, scalability, computational efficiency and fast adaptation to new languages and domains. However, current approaches of Phrase-based SMT lacks the capabilities of producing more grammatical translations and handling long-range reordering whil...
متن کاملDecoder-based Discriminative Training of Phrase Segmentation for Statistical Machine Translation
In this paper, we propose a new method of training phrase segmentation model for phrasebased statistical machine translation(SMT). We define a good segmentation as the segmentation producing a good translation. According to this definition, we propose a method that can discriminate between a good segmentation and a bad segmentation based on the translation quality. The proposed approach constru...
متن کاملIncremental Re-training for Post-editing SMT
A method is presented for incremental retraining of an SMT system, in which a local phrase table is created and incrementally updated as a file is translated and post-edited. It is shown that translation data from within the same file has higher value than other domain-specific data. In two technical domains, within-file data increases BLEU score by several full points. Furthermore, a strong re...
متن کاملDynamically Integrating Cross-Domain Translation Memory into Phrase-Based Machine Translation during Decoding
Our previous work focuses on combining translation memory (TM) and statistical machine translation (SMT) when the TM database and the SMT training set are the same. However, the TM database will deviate from the SMT training set in the real task when time goes by. In this work, we concentrate on the task when the TM database and the SMT training set are different and even from different domains...
متن کامل